home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Disc to the Future 2
/
Disc to the Future Part II Programmer's Reference (Wayzata Technology)(6013)(1992).bin
/
MAC
/
THINKC
/
4_0
/
FLEX-TC_
/
FLEX.1
< prev
next >
Wrap
Text File
|
1990-01-02
|
21KB
|
717 lines
.TH FLEX 1 "20 June 1989" "Version 2.1"
.SH NAME
flex - fast lexical analyzer generator
.SH SYNOPSIS
.B flex
[
.B -bdfipstvFILT -c[efmF] -Sskeleton_file
] [
.I filename
]
.SH DESCRIPTION
.I flex
is a rewrite of
.I lex
intended to right some of that tool's deficiencies: in particular,
.I flex
generates lexical analyzers much faster, and the analyzers use
smaller tables and run faster.
.SH OPTIONS
In addition to lex's
.B -t
flag, flex has the following options:
.TP
.B -b
Generate backtracking information to
.I lex.backtrack.
This is a list of scanner states which require backtracking
and the input characters on which they do so. By adding rules one
can remove backtracking states. If all backtracking states
are eliminated and
.B -f
or
.B -F
is used, the generated scanner will run faster (see the
.B -p
flag). Only users who wish to squeeze every last cycle out of their
scanners need worry about this option.
.TP
.B -d
makes the generated scanner run in
.I debug
mode. Whenever a pattern is recognized the scanner will
write to
.I stderr
a line of the form:
.nf
--accepting rule #n
.fi
Rules are numbered sequentially with the first one being 1. Rule #0
is executed when the scanner backtracks; Rule #(n+1) (where
.I n
is the number of rules) indicates the default action; Rule #(n+2) indicates
that the input buffer is empty and needs to be refilled and then the scan
restarted. Rules beyond (n+2) are end-of-file actions.
.TP
.B -f
has the same effect as lex's -f flag (do not compress the scanner
tables); the mnemonic changes from
.I fast compilation
to (take your pick)
.I full table
or
.I fast scanner.
The actual compilation takes
.I longer,
since flex is I/O bound writing out the big table.
.IP
This option is equivalent to
.B -cf
(see below).
.TP
.B -i
instructs flex to generate a
.I case-insensitive
scanner. The case of letters given in the flex input patterns will
be ignored, and the rules will be matched regardless of case. The
matched text given in
.I yytext
will have the preserved case (i.e., it will not be folded).
.TP
.B -p
generates a performance report to stderr. The report
consists of comments regarding features of the flex input file
which will cause a loss of performance in the resulting scanner.
Note that the use of
.I REJECT
and variable trailing context (see
.B BUGS)
entails a substantial performance penalty; use of
.I yymore(),
the
.B ^
operator,
and the
.B -I
flag entail minor performance penalties.
.TP
.B -s
causes the
.I default rule
(that unmatched scanner input is echoed to
.I stdout)
to be suppressed. If the scanner encounters input that does not
match any of its rules, it aborts with an error. This option is
useful for finding holes in a scanner's rule set.
.TP
.B -v
has the same meaning as for lex (print to
.I stderr
a summary of statistics of the generated scanner). Many more statistics
are printed, though, and the summary spans several lines. Most
of the statistics are meaningless to the casual flex user, but the
first line identifies the version of flex, which is useful for figuring
out where you stand with respect to patches and new releases.
.TP
.B -F
specifies that the
.ul
fast
scanner table representation should be used. This representation is
about as fast as the full table representation
.ul
(-f),
and for some sets of patterns will be considerably smaller (and for
others, larger). In general, if the pattern set contains both "keywords"
and a catch-all, "identifier" rule, such as in the set:
.nf
"case" return ( TOK_CASE );
"switch" return ( TOK_SWITCH );
...
"default" return ( TOK_DEFAULT );
[a-z]+ return ( TOK_ID );
.fi
then you're better off using the full table representation. If only
the "identifier" rule is present and you then use a hash table or some such
to detect the keywords, you're better off using
.ul
-F.
.IP
This option is equivalent to
.B -cF
(see below).
.TP
.B -I
instructs flex to generate an
.I interactive
scanner. Normally, scanners generated by flex always look ahead one
character before deciding that a rule has been matched. At the cost of
some scanning overhead, flex will generate a scanner which only looks ahead
when needed. Such scanners are called
.I interactive
because if you want to write a scanner for an interactive system such as a
command shell, you will probably want the user's input to be terminated
with a newline, and without
.B -I
the user will have to type a character in addition to the newline in order
to have the newline recognized. This leads to dreadful interactive
performance.
.IP
If all this seems to confusing, here's the general rule: if a human will
be typing in input to your scanner, use
.B -I,
otherwise don't; if you don't care about how fast your scanners run and
don't want to make any assumptions about the input to your scanner,
always use
.B -I.
.IP
Note,
.B -I
cannot be used in conjunction with
.I full
or
.I fast tables,
i.e., the
.B -f, -F, -cf,
or
.B -cF
flags.
.TP
.B -L
instructs flex to not generate
.B #line
directives (see below).
.TP
.B -T
makes flex run in
.I trace
mode. It will generate a lot of messages to stdout concerning
the form of the input and the resultant non-deterministic and deterministic
finite automatons. This option is mostly for use in maintaining flex.
.TP
.B -c[efmF]
controls the degree of table compression.
.B -ce
directs flex to construct
.I equivalence classes,
i.e., sets of characters
which have identical lexical properties (for example, if the only
appearance of digits in the flex input is in the character class
"[0-9]" then the digits '0', '1', ..., '9' will all be put
in the same equivalence class).
.B -cf
specifies that the
.I full
scanner tables should be generated - flex should not compress the
tables by taking advantages of similar transition functions for
different states.
.B -cF
specifies that the alternate fast scanner representation (described
above under the
.B -F
flag)
should be used.
.B -cm
directs flex to construct
.I meta-equivalence classes,
which are sets of equivalence classes (or characters, if equivalence
classes are not being used) that are commonly used together.
A lone
.B -c
specifies that the scanner tables should be compressed but neither
equivalence classes nor meta-equivalence classes should be used.
.IP
The options
.B -cf
or
.B -cF
and
.B -cm
do not make sense together - there is no opportunity for meta-equivalence
classes if the table is not being compressed. Otherwise the options
may be freely mixed.
.IP
The default setting is
.B -cem
which specifies that flex should generate equivalence classes
and meta-equivalence classes. This setting provides the highest
degree of table compression. You can trade off
faster-executing scanners at the cost of larger tables with
the following generally being true:
.nf
slowest smallest
-cem
-ce
-cm
-c
-c{f,F}e
-c{f,F}
fastest largest
.fi
Note that scanners with the smallest tables compile the quickest, so
during development you will usually want to use the default, maximal
compression.
.TP
.B -Sskeleton_file
overrides the default skeleton file from which flex constructs
its scanners. You'll never need this option unless you are doing
flex maintenance or development.
.SH INCOMPATIBILITIES WITH LEX
.I flex
is fully compatible with
.I lex
with the following exceptions:
.IP -
There is no run-time library to link with. You needn't
specify
.I -ll
when linking, and you must supply a main program. (Hacker's note: since
the lex library contains a main() which simply calls yylex(), you actually
.I can
be lazy and not supply your own main program and link with
.I -ll.)
.IP -
lex's
.B %r
(Ratfor scanners) and
.B %t
(translation table) options
are not supported.
.IP -
The do-nothing
.ul
-n
flag is not supported.
.IP -
When definitions are expanded, flex encloses them in parentheses.
With lex, the following
.nf
NAME [A-Z][A-Z0-9]*
%%
foo{NAME}? printf( "Found it\\n" );
%%
.fi
will not match the string "foo" because when the macro
is expanded the rule is equivalent to "foo[A-Z][A-Z0-9]*?"
and the precedence is such that the '?' is associated with
"[A-Z0-9]*". With flex, the rule will be expanded to
"foo([A-z][A-Z0-9]*)?" and so the string "foo" will match.
Note that because of this, the
.B ^, $, <s>,
and
.B /
operators cannot be used in a definition.
.IP -
The undocumented lex-scanner internal variable
.B yylineno
is not supported.
.IP -
The
.B input()
routine is not redefinable, though may be called to read characters
following whatever has been matched by a rule. If
.B input()
encounters an end-of-file the normal
.B yywrap()
processing is done. A ``real'' end-of-file is returned as
.I EOF.
.IP
Input can be controlled by redefining the
.B YY_INPUT
macro.
YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)". Its
action is to place up to max_size characters in the character buffer "buf"
and return in the integer variable "result" either the
number of characters read or the constant YY_NULL (0 on Unix systems)
systems) to indicate EOF. The default YY_INPUT reads from the
file-pointer "yyin" (which is by default
.I stdin),
so if you
just want to change the input file, you needn't redefine
YY_INPUT - just point yyin at the input file.
.IP
A sample redefinition of YY_INPUT (in the first section of the input
file):
.nf
%{
#undef YY_INPUT
#define YY_INPUT(buf,result,max_size) \\
result = (buf[0] = getchar()) == EOF ? YY_NULL : 1;
%}
.fi
You also can add in things like counting keeping track of the
input line number this way; but don't expect your scanner to
go very fast.
.IP -
.B output()
is not supported.
Output from the ECHO macro is done to the file-pointer
"yyout" (default
.I stdout).
.IP -
If you are providing your own yywrap() routine, you must "#undef yywrap"
first.
.IP -
To refer to yytext outside of your scanner source file, use
"extern char *yytext;" rather than "extern char yytext[];".
.IP -
.B yyleng
is a macro and not a variable, and hence cannot be accessed outside
of the scanner source file.
.IP -
flex reads only one input file, while lex's input is made
up of the concatenation of its input files.
.IP -
The name
.bd
FLEX_SCANNER
is #define'd so scanners may be written for use with either
flex or lex.
.IP -
The macro
.bd
YY_USER_ACTION
can be redefined to provide an action
which is always executed prior to the matched rule's action. For example,
it could be #define'd to call a routine to convert yytext to lower-case,
or to copy yyleng to a global variable to make it accessible outside of
the scanner source file.
.IP -
In the generated scanner, rules are separated using
.bd
YY_BREAK
instead of simple "break"'s. This allows, for example, C++ users to
#define YY_BREAK to do nothing (while being very careful that every
rule ends with a "break" or a "return"!) to avoid suffering from
unreachable statement warnings where a rule's action ends with "return".
.SH ENHANCEMENTS
.IP -
.I Exclusive start-conditions
can be declared by using
.B %x
instead of
.B %s.
These start-conditions have the property that when they are active,
.I no other rules are active.
Thus a set of rules governed by the same exclusive start condition
describe a scanner which is independent of any of the other rules in
the flex input. This feature makes it easy to specify "mini-scanners"
which scan portions of the input that are syntactically different
from the rest (e.g., comments).
.IP -
.I yyterminate()
can be used in lieu of a return statement in an action. It terminates
the scanner and returns a 0 to the scanner's caller, indicating "all done".
.IP -
.I End-of-file rules.
The special rule "<<EOF>>" indicates
actions which are to be taken when an end-of-file is
encountered and yywrap() returns non-zero (i.e., indicates
no further files to process). The action can either
point yyin at a new file to process, in which case the
action should finish with
.I YY_NEW_FILE
(this is a branch, so subsequent code in the action won't
be executed), or it should finish with a
.I return
statement. <<EOF>> rules may not be used with other
patterns; they may only be qualified with a list of start
conditions. If an unqualified <<EOF>> rule is given, it
applies only to the INITIAL start condition, and
.I not
to
.B %s
start conditions.
These rules are useful for catching things like unclosed comments.
An example:
.nf
%x quote
%%
...
<quote><<EOF>> {
error( "unterminated quote" );
yyterminate();
}
<<EOF>> {
yyin = fopen( next_file, "r" );
YY_NEW_FILE;
}
.fi
.IP -
flex dynamically resizes its internal tables, so directives like "%a 3000"
are not needed when specifying large scanners.
.IP -
The scanning routine generated by flex is declared using the macro
.B YY_DECL.
By redefining this macro you can change the routine's name and
its calling sequence. For example, you could use:
.nf
#undef YY_DECL
#define YY_DECL float lexscan( a, b ) float a, b;
.fi
to give it the name
.I lexscan,
returning a float, and taking two floats as arguments. Note that
if you give arguments to the scanning routine, you must terminate
the definition with a semi-colon (;).
.IP -
flex generates
.B #line
directives mapping lines in the output to
their origin in the input file.
.IP -
You can put multiple actions on the same line, separated with
semi-colons. With lex, the following
.nf
foo handle_foo(); return 1;
.fi
is truncated to
.nf
foo handle_foo();
.fi
flex does not truncate the action. Actions that are not enclosed in
braces are terminated at the end of the line.
.IP -
Actions can be begun with
.B %{
and terminated with
.B %}.
In this case, flex does not count braces to figure out where the
action ends - actions are terminated by the closing
.B %}.
This feature is useful when the enclosed action has extraneous
braces in it (usually in comments or inside inactive #ifdef's)
that throw off the brace-count.
.IP -
All of the scanner actions (e.g.,
.B ECHO, yywrap ...)
except the
.B unput()
and
.B input()
routines,
are written as macros, so they can be redefined if necessary
without requiring a separate library to link to.
.IP -
When
.B yywrap()
indicates that the scanner is done processing (it does this by returning
non-zero), on subsequent calls the scanner will always immediately return
a value of 0. To restart it on a new input file, the action
.B yyrestart()
is used. It takes one argument, the new input file. It closes the
previous yyin (unless stdin) and sets up the scanners internal variables
so that the next call to yylex() will start scanning the new file. This
functionality is useful for, e.g., programs which will process a file, do some
work, and then get a message to parse another file.
.IP -
Flex scans the code in section 1 (inside %{}'s) and the actions for
occurrences of
.I REJECT
and
.I yymore().
If it doesn't see any, it assumes the features are not used and generates
higher-performance scanners. Flex tries to be correct in identifying
uses but can be fooled (for example, if a reference is made in a macro from
a #include file). If this happens (a feature is used and flex didn't
realize it) you will get a compile-time error of the form
.nf
reject_used_but_not_detected undefined
.fi
You can tell flex that a feature is used even if it doesn't think so
with
.B %used
followed by the name of the feature (for example, "%used REJECT");
similarly, you can specify that a feature is
.I not
used even though it thinks it is with
.B %unused.
.IP -
Comments may be put in the first section of the input by preceding
them with '#'.
.SH FILES
.TP
.I flex.skel
skeleton scanner
.TP
.I lex.yy.c
generated scanner (called
.I lexyy.c
on some systems).
.TP
.I lex.backtrack
backtracking information for
.B -b
flag (called
.I lex.bck
on some systems).
.SH "SEE ALSO"
.LP
lex(1)
.LP
M. E. Lesk and E. Schmidt,
.I LEX - Lexical Analyzer Generator
.SH AUTHOR
Vern Paxson, with the help of many ideas and much inspiration from
Van Jacobson. Original version by Jef Poskanzer. Fast table
representation is a partial implementation of a design done by Van
Jacobson. The implementation was done by Kevin Gong and Vern Paxson.
.LP
Thanks to the many flex beta-testers and feedbackers, especially Casey
Leedom, Frederic Brehm, Nick Christopher, Chris Faylor, Eric Goldman, Eric
Hughes, Greg Lee, Craig Leres, Mohamed el Lozy, Jim Meyering, Esmond Pitt,
Jef Poskanzer, and Dave Tallman. Thanks to Keith Bostic, John Gilmore, Bob
Mulcahy, Rich Salz, and Richard Stallman for help with various distribution
headaches.
.LP
Send comments to:
.nf
Vern Paxson
Real Time Systems
Bldg. 46A
Lawrence Berkeley Laboratory
1 Cyclotron Rd.
Berkeley, CA 94720
(415) 486-6411
vern@csam.lbl.gov
vern@rtsg.ee.lbl.gov
ucbvax!csam.lbl.gov!vern
.fi
I will be gone from mid-July '89 through mid-August '89. From August on,
the addresses are:
.nf
vern@cs.cornell.edu
Vern Paxson
CS Department
Grad Office
4126 Upson
Cornell University
Ithaca, NY 14853-7501
<no phone number yet>
.fi
Email sent to the former addresses should continue to be forwarded for
quite a while. Also, it looks like my username will be "paxson" and
not "vern". I'm planning on having a mail alias set up so "vern" will
still work, but if you encounter problems try "paxson".
.SH DIAGNOSTICS
.LP
.I flex scanner jammed -
a scanner compiled with
.B -s
has encountered an input string which wasn't matched by
any of its rules.
.LP
.I flex input buffer overflowed -
a scanner rule matched a string long enough to overflow the
scanner's internal input buffer (16K bytes - controlled by
.B YY_BUF_MAX
in "flex.skel").
.LP
.I old-style lex command ignored -
the flex input contains a lex command (e.g., "%n 1000") which
is being ignored.
.SH BUGS
.LP
Some trailing context
patterns cannot be properly matched and generate
warning messages ("Dangerous trailing context"). These are
patterns where the ending of the
first part of the rule matches the beginning of the second
part, such as "zx*/xy*", where the 'x*' matches the 'x' at
the beginning of the trailing context. (Lex doesn't get these
patterns right either.)
If desperate, you can use
.B yyless()
to effect arbitrary trailing context.
.LP
.I variable
trailing context (where both the leading and trailing parts do not have
a fixed length) entails the same performance loss as
.I REJECT
(i.e., substantial).
.LP
For some trailing context rules, parts which are actually fixed-length are
not recognized as such, leading to the abovementioned performance loss.
In particular, parts using '|' or {n} are always considered variable-length.
.LP
Use of unput() or input() trashes the current yytext and yyleng.
.LP
Use of unput() to push back more text than was matched can
result in the pushed-back text matching a beginning-of-line ('^')
rule even though it didn't come at the beginning of the line.
.LP
yytext and yyleng cannot be modified within a flex action.
.LP
Nulls are not allowed in flex inputs or in the inputs to
scanners generated by flex. Their presence generates fatal
errors.
.LP
Flex does not generate correct #line directives for code internal
to the scanner; thus, bugs in
.I
flex.skel
yield bogus line numbers.
.LP
Pushing back definitions enclosed in ()'s can result in nasty,
difficult-to-understand problems like:
.nf
{DIG} [0-9] /* a digit */
.fi
In which the pushed-back text is "([0-9] /* a digit */)".
.LP
Due to both buffering of input and read-ahead, you cannot intermix
calls to stdio routines, such as, for example,
.B getchar()
with flex rules and expect it to work. Call
.B input()
instead.
.LP
The total table entries listed by the
.B -v
flag excludes the number of table entries needed to determine
what rule has been matched. The number of entries is equal
to the number of DFA states if the scanner does not use REJECT,
and somewhat greater than the number of states if it does.
.LP
To be consistent with ANSI C, the escape sequence \\xhh should
be recognized for hexadecimal escape sequences, such as '\\x41' for 'A'.
.LP
It would be useful if flex wrote to lex.yy.c a summary of the flags used in
its generation (such as which table compression options).
.LP
The scanner run-time speeds still have not been optimized as much
as they deserve. Van Jacobson's work shows that the can go
faster still.
.LP
The utility needs more complete documentation.